Identifying Anamolies in Pfizer Stock Data Using an LSTM Autoencoder

In this notebook, I will use an LSTM autoencoder to identify anomalies in the Pfizer stock price from January 2020 through September 2021.

Coming from a biological background, I am naturally drawn to the analysis of pharmaceutical stock data. Additionally, as the word is currently coming out of the tail end of a pandemic, the Pfizer vaccine has made the company one of the most consequential stocks out there.

The approach is inspired by that of TareqTayeh (1), whose Github is linked at the bottom of this document; the Pfizer stock data was obtained from Kaggle (2). I used MachineLearningMastery’s tutorials (3,4) on LSTM autoencoders and hyperparameter tuning. Lastly, I used Yahoo Finance to check my predicted anomalies against Pfizer’s stock prices (5).

The approach consists of six main steps:

1) Split and scale the data
2) Create the sequences
3) Build the LSTM autoencoder
4) Train model and run on the test data
5) Detect anomalies
6) Compare predicted anomalies with actual stock price data

Preliminary Code

This dataset contains dates ranging from June 1, 1972 to October 1st, 2021. It has 5 different stock prices for each date: "Open", "High", "Low", "Close", and "Adj Close". I will use the adjusted closing price as this quantity accounts for any corporate actions that affect the closing price of the stock, such as stock splits or rights offerings.

Splitting and scaling the data

To split the data, I will create a training and a test set. The test set consists of stock prices from the beginning January 2020 to the end September 2021. In creating the training set, I aim to choose subset of the data during which time I assume that there are almost zero anomalies. While there are many ways to construct this dataset, I will choose the decade of the 2010s as the training data, as this constitutes a large chunk of time and immediately precedes the test data.

Just a small note about the code chunk up above: due to weekends and holidays, the beginning date for Jan 2010 isn't just '2010-01-01' and likewise the end date of 2019 isn't just '2019-12-31'; for this reason, I need to collect the indices corresponding to the start and end dates for the period of time spanning the training set

Creating the time-series sequences

The input fed into the LSTM model consists of sequences. The sequences have length t, where t is the equivalent to the time-step that is walked forward. For this model, the time step is set to 30, meaning that the sequences fed into the model correspond to month-long periods of time.

Just a note about the code chunk above: for the LSTM model, the sequences must be 3D tensors of shape (training data length, time step, 1).

Building the model

I will build a model similar to that of @TareqTayeh, referenced earlier. In parallel with this notebook, I wrote a script using the package GridSearchCV to optimize the hyperparameters of the LSTM model with respect to the parameters of epochs, batch_size, and dropout rate. This and its output are uploaded in the repository for this project. For the autoencoder, I will use the parameters output by GridSearchCV

Training the model

Detecting anomalies

To detect anomalies, I will do the following:

(1) Subtract the predicted stock prices for the training data from the actual stock prices to find loss

(2) Derive an anomaly threshold value from the training data for the reconstruction loss

(3) Subtract the predicted stock prices for the test data from the actual stock prices

(4) Isolate all dates from the test data that have a loss greater than or equal to the previously derived threshold value

I have set a very rigid threshold loss value by setting a high quintile value of 99. My aim is not to find as many anamolies as possible but simply to identify true anamolies. Setting a rigid threshold value increases the chance that the identified anamolies actually are anamolies.

The model has deteced 80 dates as anomalies. Many of these dates are adjacent in time, meaning that the detected anomalies actually span periods of times. The next code chunk will actually print the start and end dates of these periods of time.

Comparing the predicted anomalies with the real-world stock data

I will use Yahoo Finance's data on the Pfizer stock to check the accuracy of the predicted anomalies. In each of the charts displayed below, the purple line represents the 30-day moving average, which was selected because the time-step for the model was set to 30. The bars at the bottom indicate the volume traded each day, with red and green indicating increases or decreases in stock price, respectively.

The periods of time highlighted in light red represent represent the anomalies detected by the LSTM model. The dates highlighted in light blue indicate days with high volume. The idea behind taking notice of the volume bars is that large volumes can signify high buying or selling pressure, which drives large increases and decreases in stock price (respectively) and causes anomalies in the stock price as a result. For the purposes of this notebook, I will assume large volume changes to be indicative of anomalies to qualitatively assess the performance of the model.

Jan to April 2020 Jan_Apr_2020-2.png

May to August 2020 May_August_2020.png

September to December 2020 Sept_Dec_2020.png

Jan to April 2021 Jan_Apr_2021.png

May to September 2021 May_Sept_2021.png

It can be seen from the charts above that in most of the time periods identified by the model as anomalies, there do exist large changes in trading volume. An exception to this the period of time from 12/09/20 to 12/14/20. Throughout 2020 and 2021, the autoencoder does seem to miss large changes in trading volume as well; some of the missed anomalies have trading volumes that outweigh those of the anomalies detected by the model (i.e. Sep-Dec 2020). Additionally, it is important to note that while the model does correctly predict periods of time in which there are anomalies, not every day in this period of time is an anomaly. All in all, in the context of forecasting the model shows mediocre performance. It may be useful in identifying periods of time during which one will observe anomalies, but it is not effective in predicting every anomaly or the largest anomalies during a period of time. More investigation is required to make this LSTM model more effective.

References

1. https://github.com/TareqTayeh/Price-TimeSeries-Anomaly-Detection-with-LSTM-Autoencoders-Keras/blob/master/code/Time%20Series%20of%20Price%20Anomaly%20Detection%20with%20LSTM%20Autoencoders%20(Keras).ipynb

2. https://www.kaggle.com/varpit94/pfizer-stock-data

3. https://machinelearningmastery.com/lstm-autoencoders/

4. https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/

5. https://finance.yahoo.com/quote/PFE?p=PFE&.tsrc=fin-srch